Skip to content

Fix: AICore Phase 1 stale L1 cache read on aicpu_ready#595

Closed
yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix_second_error
Closed

Fix: AICore Phase 1 stale L1 cache read on aicpu_ready#595
yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix_second_error

Conversation

@yanghaoran29
Copy link
Copy Markdown
Contributor

@yanghaoran29 yanghaoran29 commented Apr 20, 2026

Without rtDeviceReset between runs (removed in 8be358b), the first load of aicpu_ready in aicore_execute()'s Phase 1 spin can L1-hit a residual 1 from a prior run. The while-loop body's dcci never runs, AICore proceeds before AICPU finishes init, and downstream task dispatch scrambles -- producing ~5% precision failures on paged_attention
Case1 -> small1 (max_diff ~ 0.3).

Fix by invalidating our Handshake L1 cache line at AICore kernel exit, right after the existing flush (CACHELINE_OUT). Placing the invalidate at the writer's exit (instead of the reader's entry) keeps state reset self-contained within the case that produced it, so the next case's first load of aicpu_ready is guaranteed to miss L1 and read HBM.

Applied identically to all five AICore executors (a2a3 host_build_graph / aicpu_build_graph / tensormap_and_ringbuffer; a5 host_build_graph / tensormap_and_ringbuffer).

Validated: 100/100 PASS on host_build_graph paged_attention (was ~5% fail).

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces L1 cache invalidation at the entry point of the aicore_execute function across multiple runtime implementations. This change is designed to prevent stale cache data from previous kernel runs from incorrectly satisfying the AICPU initialization handshake. The review feedback consistently suggests using dci (Invalidate) instead of dcci (Clean and Invalidate) to eliminate the risk of stale data being written back to memory and potentially overwriting initialization values from the Host or AICPU.

Comment thread src/a2a3/runtime/aicpu_build_graph/aicore/aicore_executor.cpp Outdated
Comment thread src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp Outdated
Comment thread src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp Outdated
Without rtDeviceReset between runs (removed in 8be358b), the first load
of aicpu_ready in aicore_execute()'s Phase 1 spin can L1-hit a residual
1 from a prior run. The while-loop body's dcci never runs, AICore
proceeds before AICPU finishes init, and downstream task dispatch
scrambles -- producing ~5% precision failures on paged_attention
Case1 -> small1 (max_diff ~ 0.3).

Fix by invalidating our Handshake L1 cache line at AICore kernel exit,
right after the existing flush (CACHELINE_OUT). Placing the invalidate
at the writer's exit (instead of the reader's entry) keeps state reset
self-contained within the case that produced it, so the next case's
first load of aicpu_ready is guaranteed to miss L1 and read HBM.

Applied identically to all five AICore executors (a2a3 host_build_graph
/ aicpu_build_graph / tensormap_and_ringbuffer; a5 host_build_graph
/ tensormap_and_ringbuffer).

Validated: 100/100 PASS on host_build_graph paged_attention (was ~5%
fail).
@yanghaoran29
Copy link
Copy Markdown
Contributor Author

yanghaoran29 commented Apr 20, 2026

Bug Fix: AICore Phase 1 Stale L1 Cache Read

Symptom

When running the TestPagedAttentionHostBuildGraph suite, the pair Case1 → small1 (run sequentially in the same process) produces intermittent precision failures on small1:

  • CI: observed at least 3 times on commits that did not touch any TestPagedAttentionHostBuildGraph-related code — the same case failed with a precision error.
  • Running tests/st/a2a3/host_build_graph/paged_attention/test_paged_attention.py: Case1 always passes, small1 fails ~5% of the time.
  • Running only small1 from the same file: 100% PASS.
  • Running only Case1 from the same file: 100% PASS.
  • Adding log statements around the Host Build Graph Handshake and rerunning the same file: both Case1 and small1 pass every time — a textbook cache / timing race signature.

Relevant recent change

The only recent change to HostBuildGraph is commit 8be358b, which switched the testing model from "load the .so once, run one case" to "load the .so once, run multiple cases." This strongly suggests that state from the first case leaks into the second.

Root cause

In aicore_executor.cpp::aicore_execute(), the Phase 1 spin on aicpu_ready is written as:

__gm__ Handshake *my_hank = &runtime->workers[block_idx];

// Phase 1: Wait for AICPU initialization signal
while (my_hank->aicpu_ready == 0) {
    dcci(my_hank, SINGLE_CACHE_LINE);
}

The execution order of this loop is:

load aicpu_ready  →  check  →  dcci  →  load aicpu_ready  →  check  →  dcci  →  ...

Because commit 8be358b removed the rtDeviceReset that used to run between executions, the cache line for the Handshake struct left in the AICore L1 by the previous aicore_execute invocation (in the same process) can carry over into this one. If the first load of aicpu_ready happens to L1-hit a residual 1 from the previous run, the while condition is false on the first iteration — the loop exits without ever calling dcci. AICore then proceeds into Phases 2 and 3 while the AICPU of the current run has not actually finished init, scrambling downstream task dispatch and producing wrong accumulator values for some heads.

Direct evidence: a snapshot probe placed at AICore kernel entry (before any dcci) observed values like 3 or 7 in Handshake::aicpu_ready on some workers. These are neither legal initial states nor in-protocol transient states — they are bytes left in L1 by the previous run.

Why single-case runs are fine but "load-once + multi-case" isn't

This is the same root cause showing up differently under two execution modes, and it hinges on one fact: AICore L1 cache lifetime is tied to .so load / process lifetime, not to a single aicore_execute invocation.

Mode A: one case per process

  • The test harness spawns a fresh process for each case, dlopens the runtime .so once, and issues exactly one aicore_execute.
  • When the process starts, AICore's L1 cache is effectively clean from this process's perspective — it cannot contain a cache line for the Handshake address yet, because this process has never touched that GM region.
  • The first Phase 1 load of aicpu_ready is guaranteed to L1-miss and fall through to HBM, picking up whatever the AICPU actually wrote (initially 0, then 1 once AICPU init completes).
  • The spin runs normally; the dcci → load → check cadence relies on miss-driven refresh exactly as designed. The bug is simply not reachable.
  • Hence small1 alone and Case1 alone are both 100% PASS.

Mode B: load once, run multiple cases sequentially (the failure path)

  • The test framework reuses a single Python process. The .so is loaded once, and Case1 and small1 both trigger aicore_execute inside the same process.
  • When Case1 finishes, AICPU has already written aicpu_ready = 1 into the Handshake, and the corresponding AICore L1 cache line has been loaded / read, leaving it valid with contents 1.
  • Because commit 8be358b removed the rtDeviceReset between runs, there is no forced action on the run boundary that invalidates AICore L1:
    • no hardware reset;
    • no explicit cache flush / invalidate;
    • AICore L1 line eviction is driven purely by capacity and replacement policy, so the line is not guaranteed to be evicted between two runs.
  • When small1's aicore_execute starts, the Handshake cache line may still be valid and still hold the 1 left over from Case1 (or some garbage byte — see the 3 / 7 values in the snapshot evidence).
  • The first Phase 1 load is an L1 hit on this stale non-zero value. The while is false on the first iteration, so the loop body's dcci never runs. AICore skips the handshake and enters Phases 2/3 while small1's AICPU is still initializing.
  • Whether the line is hit depends on L1 replacement behavior and the GM access pattern between the two runs — that's why failure is intermittent (~5%) rather than deterministic.

Side-by-side

Dimension Single case (own process) Multiple cases in one process (shared .so)
Residual Handshake in L1 No (process has never touched it) Yes (left by previous aicore_execute)
First Phase 1 load Must miss → reads HBM May hit → reads previous run's 1
Phase 1 spin dcci Executes, refreshes the line Skipped — never gets a chance to refresh
Is AICPU init guaranteed complete Always (we wait) Not guaranteed — may proceed early
Observed result 100% PASS ~5% intermittent precision failure

In other words: the Case1 → small1 failure is not a bug in small1; it is L1 state from Case1 bleeding into small1, compounded by the "no rtDeviceReset between runs" precondition. This also explains why every workaround that perturbs the L1 state at small1 entry (extra logging, usleep, additional GM stores) magically makes the test green.

Why "binaries don't unload" is a precondition for cross-case residue

Ascend uses a three-part .so/.o deployment model: at process start, the Host .so is dlopened once; the AICPU .so and AICore .o are registered once with the device via rtDevBinaryRegister (or an equivalent runtime API), producing long-lived binary handles. Every subsequent rtKernelLaunch is just an invoke of the already-resident device function — the binary is not reloaded. Only at process exit are the binaries Unregistered / dlclosed.

The per-case lifecycle looks like this:

process start:
  dlopen(host.so)
  rtDevBinaryRegister(aicpu.so / aicore.o)   ← binaries land on device

Case1:
  rtKernelLaunch(aicore_execute)             ← just an invoke
    AICore: enter aicore_execute
            Phase 1/2/3 handshake → main loop dispatches tasks
            sees AICORE_EXIT_SIGNAL → break
            flush + invalidate → return      ← kernel instance ends, AICore idle

  (AICore executes nothing during this gap; Host compares goldens, prepares Case2 Runtime)

Case2:
  rtKernelLaunch(aicore_execute)             ← invoke the same code again
    AICore: re-enter aicore_execute from line 1
            first load of aicpu_ready        ← this is where it matters
            ...

process exit:
  rtDevBinaryUnregister / dlclose            ← this is when binaries actually unload

Consequences:

  • aicore_execute does return between cases. Stack locals (my_hank, task_id, last_task_id) are re-initialized on the next invocation; nothing carries over via the call stack.
  • But the device binary does not unload. The next launch runs the same device code against the same HBM addresses for Runtime / Handshake — virtual and physical addresses are both unchanged.
  • Because the address is unchanged and there's no rtDeviceReset, the L1 cache line that corresponds to that address in the physical core is not passively invalidated.
  • So the bug's residue lives purely at the hardware cache layer. It has nothing to do with C++ locals, call stacks, or binary contents — the function had already returned; only the L1 line was still alive.
  • Conversely, if each case reloaded its .o, the new binary would likely land at a different address, and Runtime would be re-allocated — L1 would not even have a chance to hit. That matches the Mode A single-case scenario above.

In short, multiple cases sharing a single already-loaded binary is one of the necessary preconditions that allows L1 residue to survive across cases.

Fix

Add one dcci(my_hank, SINGLE_CACHE_LINE) (invalidate-only) at the end of the AICore kernel, right after the existing flush (CACHELINE_OUT). Once the current case exits, the Handshake cache line on this core is invalidated; the next case entering aicore_execute on the same core is guaranteed to L1-miss on its first load of aicpu_ready and fall through to HBM for the real value.

// ... execution done, about to exit ...

// Flush all dirty cache lines to HBM before kernel exit.
dcci(my_hank, SINGLE_CACHE_LINE, CACHELINE_OUT);

// Invalidate our Handshake L1 line on exit so the next case on this core
// sees a fresh aicpu_ready=0 on its first load instead of an L1-resident 1
// left over from this case (no rtDeviceReset between cases).
dcci(my_hank, SINGLE_CACHE_LINE);

Two dcci forms:

  • With CACHELINE_OUT — flush only: pushes dirty data written in this case (e.g. aicore_done) out to HBM so downstream readers (host / AICPU) can see the result. This is the original semantics and must be preserved.
  • Without CACHELINE_OUT — invalidate only: marks the line invalid in this core's L1. The newly added call is precisely for clearing a lingering aicpu_ready = 1 in L1.

Why append at exit instead of prepend at entry: every Handshake[i] on AICore is always read by the same core i — the "who dirties it / who cleans it" mapping is 1:1 in hardware. Having the writer clean up on exit and having the next reader clean up on entry are functionally equivalent on the normal path. Putting the invalidate at exit keeps "this case ends with state reset" self-contained within the function that produced the state, leaving no L1 residue for the next invocation.

Files changed

The same Phase 1 pattern exists in all five AICore executors. Each gets the same patch (4 lines appended after the existing exit flush: 3 comment lines + 1 invalidate dcci):

  • src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
  • src/a2a3/runtime/aicpu_build_graph/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
  • src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp

Validation

run_repro.sh runs Case1 → small1 N times. After the fix (host_build_graph only):

=== Summary [all] ===
total=100 pass=100 fail=0 timeout=0

100/100 PASS across two independent runs.

Scope and follow-ups

  • This patch is a point fix for the stale-read problem on AICore Phase 1 handshake fields.
  • Commit 8be358b removed the rtDeviceReset between runs. A more global defense would be to reintroduce a lightweight cache flush on run boundaries (e.g. invalidate the entire Runtime address range at the end of deinit or at the start of the next init). If similar intermittent failures reappear, the first thing to investigate is whether state from the first case is bleeding into the second.

@ChaoWao ChaoWao marked this pull request as draft April 20, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant